Fundamentals of Digital 
Engineering: 

Designing for Reliability 


Abstract 


A Micro-Course 


& 


The concept of designing for reliability will be introduced along with a brief 
overview of reliability, redundancy and traditional methods of fault tolerance 
is presented, as applied to current logic devices The fundamentals of 
advanced circuit design and analysis techniques will be the primary focus 
The introduction will cover the definitions of key device parameters and how 
analysis is used to prove circuit correctness Basic design techniques such as 
synchronous vs asynchronous design, metastable state resolution time/arbiter 
design, and finite state machine structure/implementation will be reviewed 
Advanced topics will be explored such as skew-tolerant circuit design, the 
use of triple-modular redundancy and circuit hazards, device transients and 
preventative circuit design, lock-up states in finite stale machines generated 
by logic synthesizers, device transient characteristics, radiation mitigation 
techniques, worst-case analysis, the use of timing analyzers and simulators, 
and others Case studies and lessons learned from spaceflight designs will 
be given as examples 


Introduction 


This Seminar 

• This is a seminar, not a class 

- Two Way Conversation 

- Basic Theory 

- Lessons Learned 

- Case Studies for Discussion 

• Present Your Own Case Studies for Discussion and 
Future Inclusion 

• Under Development 

- First Time This Seminar Is Given 

- Not All Topics Are Fully Developed 

- What Areas Are Useful? Guide Development. 


Reliability 

Motivation - A Case Study (1961) 


First, I believe that this nation should commit itself to 
achieving the goal, before this decade is out, of 
landing a man on the moon and returning him safely 
to the earth. 

Special Message to the Congress on Uigent National Needs 
President John F Kennedy 

Delivered in person before a joint session of Congress 
May 25. 1961 


Reliability 

Motivation - A Case Study (1986) 

It appears that there are enormous differences of opinion as to the 
probability of a failure with loss of vehicle and of human life The 
estimates range from roughly 1 in 100 to l in 100,000. The higher 
figures come from the working engineers, and the very low figures 
from management. What are the causes and consequences of this 
lack of agreement? Since 1 part in 100,000 would imply that one 
could put a Shuttle up each day for 300 years expecting to lose 
only one, we could properly ask "What is the cause of 
management's fantastic faith in the machinery 9 ” 

ft P Feynman, Report of the PRESIDENTIAL COMMISSION on the Space Shuttle 
Challenger Accident, Volume 2 Appendix F - Personal Observations on Reliability of 
Shuttle, June 6th, 1986 








Reliability 

Motivation - A Case Study (2001) 

When discussing the impact of the high observed FIT 
rate for the FPGAs, the IAT asked Lockheed Martin 
“Whafs the reliability allocation?’'' Lockheed Martin 
responded, '"Hell if I know.' 1 

The IAT followed up by stating that it appeared that 
there has been no calculation of the probability of 
mission success. Lockheed Martin concurred and JPL 
added: “No programmatic requirement for reliability 
numbers.” 

From the Man Odyssey FPGA Independent Assessment Team, Apnl 2, 2001. 


Increasing Reliability 

Fault Prevention 

- Eliminate Faults 

- In Practice, Reduce Probability of Failure to an 
Acceptable Level 

Fault Tolerance 

- Faults Are Expected 

- Use Redundancy 

• Additional Hardware, Software, Time 


Conventional Techniques for High- 
Reliable Spacebome Digital Systems 

Use of Conservative Design Practices 
- Derating, Simplicity, Wide Tolerances 
Parts Standardization 

100% Screening of Parts and Assemblies, Including Thorough 
Bum-in 

Detailed Laboratory Analyses and Corrective Action for All 
Failed Parts 

Use of Extreme Care in Manufacture of Parts 

Thorough Qualification of Parts and Manufacturing Processes 


Conventional Techniques for High- 
Reliable Spacebome Digital Systems 

(cont’d) 

• Thermal Cycling and Vibration Testing of All Completed 
Assemblies 

• Establishment of an efficient field service feedback system to 
report on equipment failures in the Field 

■ Design of the Equipment to Minimize Stress During 
Assembly and to Facilitate Replacement of Failed 
Components 


NASA SPACE V Filin F. HESTON CRITERI A -GUIDANCE AND CONTROI - « 
SP \CTBORNT DIGITAL COMPUTER SYSTEMS - >P-Sn?o 
MARCH i 27 1 


What We Will Do 


What We Will Not Do 


Cover Basic Concepts 

Present Data and Design Techniques 

Case Studies 

- Solutions for Previous Missions 

- Mistakes from Previous Missions 


Provide Exhaustive Coverage 

- We only have a few hours 

- Too much material 

Solve All Problems 

- Goal is to make you think 

Not discuss “Mom and Apple Pie” [well, at 
least minimize it] 











Termination of Special Pins 

Special Pins 


• MODE pin (test program mode). 

• V PP pin (programming voltage). 

• TRST* (Reset to JTAG TAP controller) 

A Verv Basic Tonic But A Source of 


• TCLK (provides clock to TAP controller) 

Frequent Failures and Problems 


• SDI, DCLK (varies for each device type) 

• Others 


MODE Pin 

• Left Floating 

- Device can be non-functional 

- High currents 

- Uncontrolled I/O 

• Tied High During Test 

— Working device stopped functioning 

- Power supply rise time key 




IEEE JTAG 1149.1 TCLK 



The CLK pin may turn into an output driving low, damping 
the oscillator’s output at a logic ‘O’. The TAP controller can 
not reset and restore I/O operation. Most FPGAs do not have 
the optional TRST* pin. Note TRST\ when present, has a 
pull-up. 






System Logic 






Input Stages - Introduction 

• Most CMOS inputs have rise/fall time limits 

- Most inputs also have some hysteresis 

• Typical symbols in specifications 

I*, t n!i - rise time 

V. l THL ’ fal1 time 

t T - transition time 

• Waveform measurement 

- ty pically from 10% to 90% but not always 

- sometime parameter measurement method is not 
specified 


Input Stages - Practice 

• Data sheets may list a parameter for 
information only and not 100% tested 

• Laboratory devices have shown that not all 
qualified devices will meet the data sheet 

- One case was when a part was shrunk 

- Migration to a faster process 

- Oscillations observed 

• Conservative margins recommended 


Input Stages - Termination 

• Floating CMOS inputs are, in general, ’bad.* 

- Totem-pole currents, oscillations, etc. 

• Some devices offer pull-up/down resistors 

- SX-S only active during power transitions 

- Xilinx resistors controlled by SRAM 

- Care on internal tri-state lines 

• Dedicated Inputs 

- Actel unused inputs w ere handled by s/w 

- Not true for some SX, SX-S clocks 
=> Check each case carefully 


Input Stages - Termination 

Case Study: SX-S Clock Pin 



tqjO3O«»W7O»0W10O 



Input Transition Times 


Part Number Reference 1t 

(”)■ 


A 1 03) 

1 

500 

At02DA 

2 

390 

AI020B 

3 

500 

RH1020 

< 

500 

A 1 2SO 

2 

590 

AI2JOA 

4 

590 

RH! 2*0 

4 

$00 

Ac13-0*h»yi(JV) 


$007 

Act 3 • 0.3 j»m(3 J) 

6 


RT54SXI6. 32 

7 

SO 

AJ4SX-A (32, 72) 



RT54SXS 



XQR4000XL 

10 

250 

Virt« 

11 

250 

LT22VPI0 

12 

? 

AT® 10 (MIL) 

13 

50 

AT® 10 (3 3 V) 

14 

SO 


QuicUogic 







[1] 

Input Transition Times 
References 

ACT ™ I Field Programmable Gale Arrays, March 1 991. 

[2] 

ACT 1 and ACT 2 Military FPGAs. April, 1992. 

[31 

ACT™ 1 Series FPGAs. Apnl 1996. 

14) 

Radiation Hardened FPGAs, v3.0. January 2000. 

[5] 

ACT™ 2 Series FPGAs, April 1996. 

[6] 

Accelerator Scries FPGAs - ACT ™ 3 Family, September, 1997. 

n 

54SX Family FPGAs RadTolerant and HiRel, Preliminary V1.5, 

[»] 

March 2000 

HiRel SX-A Family FPGAs, Adv anced v.l, April 2000. 

[9] 

RT54SX-5 RadToleram FPGAs for Space Applications, Advanced 

[JO] 

0.2, November, 2000. 

QPRO XQR4000XL Radiation Hardened FPGAs, DS071 (vl.l) June 

[HJ 

25, 2000. 

QPRO™ Virtex™ 2.5V Radiation Hardened FPGAs, DS028 (vl.O) 

[121 

April 25, 2000 Advance Product Specification. 
Not in data sheet. 

[13] 

Configurable Logic Data Book, Atmel, August 1995. 

[I4J 

AT6000LV, Atmel. October 1999. 


Clock Transition Time Specification 

A Difficult Case 


m i is Obif-Poft Srefrsman sme 
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AC Electrical Characteristics Over the Operating Temperature Range 
(Read and Write Cycle Timing)^ ivao « uvi isemv, t* * 0‘C t»*7trc) 


t t 


Transition Time Requirements 

Implications - Pullup Resistors 

Often used for tri-state or bi-directional 
busses 

Rise time (10% - 90%) = x = 2.2 RC 
Example 

C - 50 pF 

R * 10 kfi (keep power levels reasonable) 
x = 500 ns 

=? violates many devices' specifications (see table) 


Transition Time Requirements 

Implications - Filters and Protection Circuits 

Often used on signals 

- Elimination of noise 

- ESD protection 

- Etc. 

RC filters or clamps (high C) can often 
substantially degrade transition times 
Consider discrete hysteresis buffers, 
particularly for clock signals 


Bus Hold Circuit in an FPGA 


3 I fi U F 


Q V T 6 ij r 


• — — ■ 


Supplies leakage 
current only. 


Transition Time Requirements 

Implications - Interfacing with older logic 
families 

Case Study (1) 

- CD4000B CMOS NOR gate 
-V dd = 5V 

- t T (typ) = 1 00 ns 
Case Study (2) 

- CD4050B (used as a level shifter, for example) 
-V dd = 5V 

- t T (max, 25 °C) = 1 60 ns 












Transition Time Requirements 

Implications - Interfacing with older logic 
families (cont'd) 

Case Study (3) - 54HC00 CMOS NOR gate 

- 5962-8403701 VDA, NAND GATE, QUAD 2-INPUT 



Test conditions U 

Lirruts 1 Unit 1 


-5S°C S T c £ 1 25°C 
unless otherwise specified 

Min 

Max 

Transition time, , 

T c - +25°C 



75 

output rise and 

C L = S0pP 

V^-4.5 


15 ns 

fall 1 / 

See figure 4 

v w =«.o 


13 


T c * -55®C, -55°C 



no 


C L = 50pF 

Vcc**-5 


22 ns 


See figure 4 



19 


V Transition time if not tested, shall be guaranteed to the specified limits in 

table 1. 




From: Figure 4, 5962-8403701 VDA, NAND GATE, QUAD 2-INPUT 


Transition Time Requirements 
Case Study: RH1020 

Production Parts 

- Input stage was modified for clock upset 

V cc = +5VDC 
T=25°C 

CLKBUF monitored on output 

- Because of design of the buffer, difficult to sec effects on the input 
pin 

Used a low- impedance signal generator, triangle waveform 
Commercial specification is t R , t* of 500 ns 

- RH1020 did not meet this specification 

- SMD 5962-90965 does not specify this parameter 


Transition Time Requirements 
Case Study: RH1020 CLKBUF 


-105.00 ns 

395.00 ns 

100 ns/div 

rsBl time 

{ 1 ) 395 . 715nc 



Transition Time Requirements 
RH1020 CLKBUF @ threshold 


-205.000 ns 

frequ&ncy (3) IO4.OO0t1Ht 


-105.000 ns 
20.0 ns/(Jlv 


Transition Time Requirements 
Case Study: RH1020 CLKBUF Notes 

Conditions: Room temp; V cc = 5.0 V. 
Oscillations detected consistently at t R = 
360 ns 

Sporadic output pulses at t R — 300 ns 
Transition time requirement not symmetric 

- Oscillations detected consistently at t F = 1.5 ps 

- Sporadic output pulses observed at t F = 1 .0 ps 
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Startup Current Transient 

Xilinx Technology 

Two sets of requirements for the power-on 
transient for Xilinx XQR4000XL and Virtex 2.5V 
FPGAs. 

- Rise time 

- Current capability of the power supply. 

Noted that unlike Actel FPGAs where slower 
power supply rise times result in higher current 
values, in Xilinx devices, faster rise times result in 
higher current values. 


Startup Current Transient 

Xilinx XQR4000XL 

Rise Time 

- Slowest power supply rise time is 50 ms Many power supplies 
can meet this specification easily 

- Some space borne power supplies may have longer rise times 

Current Levels 

- The minimum cument is broken into two groups XQR4013-36XL 
and the XQR4062XL Note that according to the specification, the 
values refer to commercial and industrial grade products only, with 
the transition measured from 0 VDC to 3 6 VDC Actual currents 
may be higher than the minimums specified 

- Note 3 in the specification states that the duration of the peak 
current level will be less than 3 ms 


Startup Current Transient 

Xilinx Virtex 

Complete power supply requirements are not yet specified in the 
radiation hard data sheet, Some of the information is taken from the 
commercial data sheet. 

Rise Time 

- Slowest power supply rise rime for this scries of parts is 50 ms. 

- The fastest suggested ramp rate is 2 ms. 

• May be slow for some pown supplies The fwamtier me imminent criteria 
on the radiation turd data sheet is from 1 VDC to 3 375 VDC 
Current Levels 

- The data sheet only specifies a minimum required current supply for 
Virtex devices at a power supply rise time of 50 ms. 

- According to the non-military specification, it is 500 mA for commercial 
grade devices and 2 A for industrial grade parts. 

- Additionally, shorter power supply rise rimes will result in highcT currents. 

- The duration of peak currents will be less than 3 ms 


Startup Current Transient 

Summary: Xilinx Technology 


- m CommfrfcrM wNf tnduWW 

- KOfi«OS7XL wnS tndu*Bi*l Gr*d» 

- VHat FimUy. ConmtrcM Gntfa 
• Jftmt Family. MutfrtM Grid* 


Start-Up Transient Study 
in the RT1280A 


An examination of the effects of 
radiation, a detailed look at the 
response of the part, annealing, and 
impacts to the board-level and system 
designs. 


0.000 

500 us/dlv 


Figure 1. Startup transient after 4krad(Si) exposure at 
1 krad (Si)/day. The left current peak is unchanged from the 
pre- irradiation measurement and remained unchanged over 
the course of this experiment. Analysis on next slide. 








Startup transient after 4 krad (Si) exposure at 1 krad (Si)/day 
Left current peak is unchanged from the pre-irradiation measurement 
and remained unchanged aver the course of this experiment 
- This current peak is expected as the NMOSFET isolate transistors art not 
fully conducting, resulting in totem pole currents in the input circuit of the 
logic modules. 

This cument level or width is not specified in either the commercial or 
military specifications 

The 350 mA current peak on the right appears when V cc reaches 
3 5 VDC 

The power supply used for these tests had a rise time of < 2 msec 
Voltage is at 1 V/div, current is at 100 mA/div. 


0.000 

500 us/titv 


2 5000 ms 
real tim* 


Figure 2. Startup transient after 5 days of room temperature, 
biased anneal, following the 4 krad {Si) irradiaton. The 
radiation-induced current peak is essentially gone. Voltage is at 
1 V/div; current is at 100 mA/div. 


0.000 3 

500 U3/UIV 


2.5000 ms 
real l imt 


Figure 3. Startup transient after an additional 2 krad (Si) 
exposure at 1 krad(Si)/day for a total of 6 krads (Si). The 
radiation-induced current peak is now about 700 mA. 
Analysis on next slide. 


Startup transient after an additional 2 krad (Si) exposure at I krad (Si)/day 
for a total of 6 krads (Si) 

The radiation-induced current peak is now about 700 mA 

The current draw still appears when V CC reaches 3.5 VDC, unchanged 

from the 4 krad (Si) radiation step 

At V cc =3.5VDC, bulk capacitors on the board will have charge 
Q = 3.5V x C, which will provide charge in addition to that available 
from the power supply and helping to support the voltage rail An 1 8 pF 
bulk capacitor will store 630 pC 
- The current draw for this transient is approximately 100 pC. 

Voltage is at lV/div, current is at 100 mA/div 


Effects of 28-day, biased, room temperature anneal after the 
6 krads (Si) irradiation step 

The radiation-induced current peak is now reduced to about 1 00 mA 
The current draw for this transient is approximately 1 2 pC, reduced 
from approximately 1 00 pC immediately after the 6 krads (Si) 
exposure 

Voltage is at 1 V/div, current is at 100 mA/div. 


0.000 

500 us /<J l v 


Figure 5. Effects of 100 °C, biased anneal after the 6 krads (Si) irradiation 
step and room temperature annealing The radiation- induced startup 
current is now virtually eliminated, showing that annealing is effective 
Voltage is at I V/div, current is at 100 mA/div 
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Static Hazard 

~^Y X! 


\: I 2:1 Mux implemented by 

minimized Sum-of-Products 


Idealized matched delays 


Static Hazard 


Static Hazard 


[ We now have a "glitch." 
Same waveform, zoomed in. 
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Illustrating the minimized function on a Karnaugh map. 

Only two 2-input AND gates are needed for the product terms 




















Implementation Level 


Asynchronous Decoding 

Glitch Generation 


Terminal count of 
a 4-bit synchronous 
counter. 


1001 9 

1010 10 
1011 11 
*1100 12 
1101 13 
*1110 14 
*1111 15 
0000 16 


^01111111111111 

”^10000000000000 

^ 11111111011111 

11111111100000 

Because of unequal propagation delays, the sequence can 
momentarily go through state 11111111111111 
generating a glitch. 


Decoder Output Used As Clock 



Logic Design 


vAveriesf w 

i'i\t ,ln.e jihi.'t 


From Erickson, MAPLD 2000 


Designer unaware that a parallel asynchronous decode may glitch Relied 
on back-annotated logic simulation This construct appears repeatedly 


Static Hazard 

Flight Design Example 


TMR Triplet Majority Voter 


High-skew bufTer 


Static Hazard 

Flight Design Example 

Care is needed when using TMR circuits. First, 
the output of the voter may be susceptible to a 
logic hazard “glitch.” This is not a problem if the 
TMR is feeding the input of another synchronous 
input. However, the TMR output should never 
feed asynchronous inputs such as flip-flop 
clocks, clears, sets, read/write inputs, etc. 

■‘Design Tcthniques for RaJtation-Kardeoed FPG *iS ’ 

-Vtcl Cutporation. Septoniber 

- o.i scd on "SEl; Hardening <>f Field Pr.Tgrjiirsrablc Gait Arrr.j {FTG A*} for Space 

:jnJDeGoo Clurjvtt-nrjrun.*’ R. Kju. R Bono. or. al„ 1F.ET, Transacnons 
on Nuclear Science. Dec. 













Asynchronous Clears 

■ 

Asynchronous Logic 


» ~i \ s 

ul; ; 

t ‘ j— — ♦ 

■ : 

: un. 


Synchronous vs. Asynchronous 
Logic 

• Asynchronous signals are not synchronized to a 
clock. 

• Timing Analysis for Asynchronous Circuits 

- Many tools do not support this 

- Complex, sometimes not tractable 

- Error-prone 

• Asynchronous logic may result in smaller, faster, 
or lower power circuits 

• Asynchronous logic, well done, is reliable. 


Is It Or Isn’t It? 


16 MHz high skew clock 


I MHz low-skew clock 



Low-skrw buffer 



Common Asynchronous Design Problems 

• Design may be marginal 

- Adequate margin non- verifiable 

• Aging and radiation effects 

- Can not test for these 

• Failures may occur late in the test program 

- i.e., thermal of thermal/vacuum testing 

- This is always on Friday night 

• System may have unexplained glitches 

- Often difficult to troubleshoot 







Some Examples of Problems 

* Spacecraft Experienced Inadvertent Reset 
During System Testing 

- Only from 1 7 to 20 °C 

- FPGAs were redesigned 

• Lots and lots of 'rookie mistakes.’ 

- No analysis and unknown margin 

- Decoded outputs used as clocks 

- High-skew signals used as clocks 

• Counters 

• Shift Registers 










Fail Safe Logic 


Orbiting Astronomical 
Observatory (OAO) 
Technology 


This ■cv.nf'ti hurt!;, -tutted, i ’f t ol tmttrul In JfKi. 


'Primurv Processor vid fMu 
Ohservainr.-. 1 ’ Thrums B Lt 


eriik? Equipment fv>r the Orttiutijs AanvMU'mcrtl 
IBM Coipi n:uivK Spat e Gtiidatux renter 


tmstTVJIfvr.. innuu'. utwoi iu.'i v 

Owe S , 7 . \y IEEE Transactions on EWtronic C'onipt iters. December ]*>&>. 
pp. (>77-C:*7 


Quad Redundant AND Gate 
Orbiting Astronomical Observatory 


i i L 


Quad Redundant OR Gate 
Orbiting Astronomical Observatory 


•Li • 


♦ >1 


Quad Redundant Inverter 
Orbiting Astronomical Observatory 












Integrated Circuit Reliability 

Historical Perspective 

Application 


Reliability 

• Apollo Guidance Computer 

< 10 FITS 

• Commercial 

(1971) 

500 Hours 

• Military 

(1971) 

2,000 Hours 

• High Reliability 

(1971) 

10,000 Hours 

• SSI/MSI/PROM 38510 

(1976) 

44-344 FITS 

• MSI/LSI CICD Hi-Rel 

(1987) 

43 FITS 


1 








Actel FPGAs 


Technology 

riTS 

# Failures 

Device- 

■Hours 

(pm) 





2 .0/1 .2 

33 

2 

9.4 X 

10 7 

1.0 

9 . 0 

6 

6.1 x 

10® 

0.8 

10 . 9 

1 

1.9 x 

10 8 

0.6 

4 . 9 

0 

1.9 x 

10 3 

0.45 

12 .6 

0 

7.3 x 

10 7 

0.35 

19.3 

0 

4 . 8 x 

10 7 

RTSX 0.6 

33.7 

0 

2.7 x 

10 7 

0.25 

88.9 

0 

1.0 x 

10 7 

0.22 

78.6 

0 

1.2 x 

10 7 


UTMC and Quicklogic 

• FPGA 

- < 10 FITS (planned) 

-Quicklogic reports 12 FIT, 60% UCL 

• UT22VP10 

UTER Technology, 0 failures, 0.3 [double check] 

• Antifuse PROM 

- 64K: 19 FIT, 60% UCL 
-256K: 76 FIT, 60% UCL 


Xilinx FPGAs 

• XC40xxXL 




- Stat ic : 

9 

FIT, 

60% UCL 

- Dynamic 

: 29 

FIT, 

60% UCL 

• XCVxxx 




- Static : 

34 

FIT, 

60% UCL 

- Dynamic 

: 443 

FIT, 

60% UCL 











Power Supply Sequencing 

Power Switching 


• Protecting I/O's 

• Powering Circuits 

• RT54SX 16/32 

- Perhaps RT54SX32S 

- UTMC buffers 

• EEPROMs/write protection 

• SMEX/WIRE 


Power Supply Sequencing 

Protecting I/O's 

• Parasitic/E SD diodes 

• PCI clamp diodes 

• cold-sparing capable I/O's 



Power Supply Sequencing 

RT54SX 16/32 


Ptwtr Up |tqu«ACln| 

UTMIXII, I14IKI, ATMIHI, WH» 


ks p»uM in 


Pom** 


•T44axifl. tMHU, ■TMII11, 4MIUI 


54SX Family FPGAi, R^lTokrant and IhRel. v 2.0, March 2001 


Power Supply Sequencing 
RT54SX32S 

• To date, our lab work has shown, on some 
parts, that when V CCI is applies before V CCA , 
significant currents, > 10 mA, can be seen 
flowing into the V CCI pin. 

• Power supply sequencing may also affect 
reliability of the safe power on/off feature. 

• These are under investigation. 









Power Supply Sequencing 

EE PROMs: Hardware Write Protection 

3.11.5 Pawr supply i*qu*nc* of EEPROMs . In order to 
reduce the probability of inadvertant writes, the 
following power supply sequences shall be observed. 

a. For device types 1-18, a logic high state shall be 
applied to ME and/or CE at the same time or before the 
application of V cc . For device types 16-18, an 
additional precaution is available, a logic low state 
shall be applied to RES at Che same time or before the 
application of V^. 

b. For device types 1-18, a logic high state shall be 
applied to WE and/or CE at the same time or before the 
removal of V cc . For device types 16-18, an additional 
precaution is available, a logic low state shall be 
applied to RES at the same time or before the removal 
of V«. 


Power Supply Sequencing 

EEPROMs: Software Write Protection 
To protect against unintentional programming caused by noise 
generated by external circuits, AS58C100I has a Software data 
protection function. To initiate Software data protection mode, 

3 bytes of data must be input, followed by a dummy write cycle 
of any address and any data byte. This exact sequence switches 
the device into protection mode. This 4th cycle during write is 
required to initiate the SDP and physically writes the address and 
data. While in SDP the entire array is protected in which writes 
can only occur if the exact SDP sequence is re-executed or the 
unprotect sequence is executed. 

The Software data protection mode can be cancelled by inputting 
the following 6 Bytes. This changes the AS58C1001 to the Non- 
Protection mode, for normal operation. 

AS58C 1001 1 28K x 8 EEPROM, Austin Semiconductor, Inc 


Power Supply Sequencing 

EEPROMs: Software Write Protection 


Power Supply Sequencing 

SMEX/WIRE 


sable Protection 



System applied power simultaneously to the 
FPGA, drive circuitry, and relay. 

Control FPGA generated both ARM and 
FIRE signals based on spacecraft opto- 
isolated inputs. 

Transient analysis not performed. 

Saved 1 relay. 











Definitions 


Redundancy 


■ Simplex 

- Single Unit 

• TMRorNMR 

- Three or n units with a voter 

• TMR/Simplex 

- After the first failure, a good unit is switched 
out with the failed unit. 

■ TMR/Switchable Spare 

- After the second failure is detected, the last 
good unit is switched in. 


Types of Redundancy 


Static Redundancy 

• Static Redundancy 


* Uses Extra Components 

• Dynamic Redundancy 


* Effect of a Fault is Masked Instantaneously 

* Hybrid Redundancy 


• Two Major Techniques 

- N-Modular Redundancy (generalization of TMR 
or Triple Modular Redundancy) 

h rro r ( o tree twig Codes 


Static Redundancy 

• TMR flip-flops 

• What happens when you add a Hamming code 
and error correct to a finite state machine? 

- Hint: Are SEUs synchronous? 


TMR/Voter Structures 


Z 9 « — f a! 



With no active clock, it*s an SEU integrator. 









Static Redundancy Example 

SEU-Hardened Flip-Flop 



Dynamic Redundancy 

Uses Extra Components 

Only 1 Copy Operates At A Times 

- Fault Detection 

- Fault Recovery 

Spares Are On “Standby” 

- Hot Spares 

- Cold Spares 


Hot and Cold Spares 

Hot Spares 

- Modules/components are powered or 'hot’ 

Cold Spares 

- Modules/components have their power removed 
or are ‘cold’ 

- Sneak path analysis is necessary, particularly with 
CMOS interfaces 

• Some CMOS I/O structures are high- impedance when 
powered down 


Interfacing - Blocks 



| Backplane 


ESD and parasitic diodes (not shown here) to the power 
bus (present in most CMOS devices) form a sneak path. 


Actlv* Bm or 
B«ckptan* 


Types of Redundancy 

Classified on how the redundant elements are 
introduced into the circuit 
Choice of redundancy type is application specific 
Active or Static Redundancy 

- External components are not required to perform the 
function of detection, decision and switching when an 
element or path in the structure fails. 

Standby or Dynamic Redundancy 

- External elements are required to detect, make a decision 
and switch to another element or parth as a replacement 
for a failed element or path. 










Redundancy Techniques 


Simple Parallel Redundancy 

Active - Type 1 


Redundancy Techniques 


Parallel "" Vntmg 

/ \ MaiomxYoiT^I 


- Oncratiny 

(8) 


Simple Duplex Bi modal 

(1) (!) (3) 


Simple Adaptive 
W (5) 



In its simplest form, 
redundancy consists of a 
simple parallel combination 
of elements. If any element 
fails open, identical paths 
exist through parallel 
redundant elements. 


Duplex Parallel Redundancy 

Active - Type 2 



This technique is applied to 
redundant logic sections, such as 
A I and A2 operating in parallel It 
is primarily used in computer 
applications where A1 and A2 can 
be used in duplex or active 
redundant modes or as a separate 
element An error detector at the 
output of each logic section 
detects noncoincident outputs and 
starts a diagnostic routine to 
determine and disable the faulty 
element. 


Bimodal Parallel Redundancy 

Active - Type 3 


(a) Bimodal Parallel/ 
Series Redundancy 



(b) Bimodal Series' 
Parallel Redundancy 



A series connection of parallel 
redundant elements provides 
protection against shorts and 
opens Direct short across the 
network due to a single element 
shorting is prevented by a 
redundant dement in series An 
open across the network is 
prevented by the parallel element 
Network (a) is useful when the 
primary element failure mode is 
open Network (b) is useful when 
the primary element failure mode 
is short 


Simple Majority Voting 

Active - Type 4 


Adaptive Majority Voting 

Active - Type 5 


Decision can be built into 
the basic parallel redundant 
model by inputting signals 
from parallel elements into a 
voter to compare each signal 
with remaining signals. 
Valid decisions are made 
only if the number of useful 
elements exceeds the failed 
elements. 


This technique exemplifies 
the majority logic 
configuration discussed 
previously with a 
comparator and switching 
network to switch out or 
inhibit failed redundant 
elements. 









Gate Connector Voting 

Active - Type 6 


Similar to majority voting. 
Redundant elements are 
generally binary circuits. 
Outputs of the binary 
elements are fed to switch- 
like gates which perform the 
voting function. The gates 
contain no components 
whose failure would cause 
the redundant circuit to fail. 
Any failures in the gate 
connector act as though the 
binary element were at fault. 


Non-Operating Redundancy 

Standby - Type 7 



A particular redundant element of a 
parallel configuration can be 
switched into an active circuit by 
connecting outputs of each element 
W<T to switch poles. Two switching 
configurations art possible. 

1) The element may be isolated 
by the switch until switching is 

Output completed and power applied to the 
element in the switching operation. 

2) All redundant elements art 
Continuously connected to the 
circuit and a single redundant 
element activated by switching 
power to it 


Operating Redundancy 

Standby - Type 8 


In this application, all 
redundant units operate 
simultaneously. A sensor on 
each unit detects failures. 
When a unit fails, a switch at 
the output transfers to the 
next unit and remains there 
until failure. 


Redundant Processors 

Software Voting for the Space Shuttle 

KiHingbeck - There are approaches to the instability problem that involve 
equalization and periodic exchanges of data - some kind of averaging, middle 
select, or whatever, to keep things from getting too far apart The problem is 
that, for every sensor, an analysis has to be made of what values are reasonable 
and how an average should be picked. The extra computation consumes a lot of 
manpower and time, and creates a lot of accuracy' problems It’s very hard to 

set a tolerance level that throws away bad data and doesn’t somehow throw away 
some good data that happen to be extreme It wasn’t so much that we felt that 
this scheme couldn't be made to work, it's just that we believe there had to be a 
better way 


Communications of the ACM, September 1984, p. 894 


Redundant Processors 

Architecture for the Space Shuttle 

KiHingbeck - We originally looked at three redundancy 
management schemes. First, we considered running as a number 
of totally independent sensor, computer, and actuator strings. This 
is a classic operating system for aircraft - the Boeing 767, for 
example, uses this basic approach. We also looked at the 
master/slave concept, where one computer is in charge of reading 
all the sensors and the other computers are in a listening mode, 
gathering information. One of the backups takes over only if the 
master fails. The third approach we considered is the one we 
decided to use, the distributed command approach, where all the 
computers get the same inputs and generate the same outputs. 

Communications of the ACM, September 1984, p. 894. 


Calculation of TMR 
Reliability for SEUs 

The probability of i arrivals in a time t is calculated as: 

P( itt 'X) = Wb£l (i) 

Following this, the interarrival rime is a continuously 
distributed exponential random variable with the average 
rime between arrivals of 1/ . 

Each particular bit is modeled independently of all other 
bits. In practice, this is not always true. For instance, certain 
memory devices may have multiple upsets in a single byte 
within one address [6]. This phenomena has not been seen in 
FPGAs. 








Calculation of TMR 
Reliability for SEUs 

The probability for a single bit not being upset can now 
be computed as the probability of an even number of arrivals 
in the scrub period and the probability for a bit being upset is 
computed as the probability of an odd number of arrivals. 

PS = Probability of Success (2) 

- Probability of no upset (3) 

= Probability of an even number of upsets (4) 

= P{0j,X)+ P{2,t,A)+ p{4.t,A)+... (5) 

and 

PF = Probability of Failure (6) 

= Probability of upset (7) 

- Probability of an odd number of upsets (8) 

= (9) 


Calculation of TMR 
Reliability for SEUs 

Now we have the following for each ‘word’ in memory: 

1 . The word consists of n (word length) “repeated" trials. 

2. Success (no upset) or failure (upset). 

3. Probability of success remains constant from bit to bit. 

4. Each bit is independent. 

which is a description of a binomial experiment. 

The probability of a failure for an experiment is having 
more errors than the code can correct, which is either 2 or 3 
for the TMR flip-flop. 


Calculation of TMR 
Reliability for SEUs 

So, P (Failure of a word) = ]T P(i upsets in a word) (10) 

where n is equal to the total word length, and 

P(i upsets in a word) = C(n,/)x PS^' i] * PF' (11) 

n\ 

where C(n,i) is defined as r (12) 

/!x(« - 1 / 

Once the probability of a word failing is calculated, 
multiplication by the number of words will give a failure 
rate. 


Simplex vs. TMR Reliability 



Reliability of Redundant Systems 
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Diverse Design 

Case Studies and Topics for Discussion 

Diverse Design 


• Definition 

• LEM abort computer 

• Skylab Lessons Learned 

• Space Station - ISS 

• Software 

• Shuttle Computers 

• Small Satellites, University of Surrey 


Diverse Design 


Diverse Design 

Definition 


Case Study - LEM Abort Guidance Computer 

In diverse design redundancy two or more components 


• Main computer 

of different design furnish the same service. 


- 1 5-bit AGC, common with the CSM 

- Single string 

This has two advantages 1 it offers high protection against failures due to 
design deficiencies, and it can offer loweT cost if the back-up unit is a ’'life- 
boat," with lower accuracy and functionality, but still adequate for the 
minimum mission needs. The installation of diverse units usually adds to 


* Not enough resources for redundancy 

• TRW produced a small computer 

% / i n hid 

logistic cost because of additional test specifications, fixtures, and spare pans 


- MARCO 44 1 8 

This form of redundancy is, therefore, economical primarily where the back- 


- 8-bit 

up unit comes from a previous satellite design, or where there is experience 
with it from another source Where there is concern about the design integrity 


• Limited functionality 

of a primary component, diverse design redundancy may have to be employed 
regardless of cost 


- Put the LEM in lunar orbit 



i Conipurrr* Flight- A History of" NASA's PHmmine 
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Diverse Design 


Diverse Design 

Skylab Lessons Learned 


Case Study: Space Station 

When designing redundancies into systems, consider the use of non identical, 
approaches for backup, alternate, and redundant items 


• No intentional diverse design, despite 

Backeround 


Skylab’s lessons learned 1 . Very expensive. 

A fundamental design deficiency can exist in both the prime and backup 
system if they are identical For example, the rate gyros in the Skylab 
attitude control system were completely redundant systems, i.e., six rate 
gyros were available, two in each axis However, the heater elements or all 
gyros were identical and had the same failure mode Thus, there was no 
true redundancy and a separate set of gyros had to be sent up on Skylab 4 
for an in-flight replacement 


• Overlap in functions between US and 
Russia provides some diversity in ISS. 

• Russian side has some diversity more as a 
result of heritage then an objective. 
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Diverse Design 

Topic for Discussion: Software 

• Not widely applied in software 

- Difficult to quantify expected improvement 

• N-version Programming 

- In hardware NMR, there are identical copies; in 
software NMR, independent coding. 

- Voted: Reference states “sufficiently similar.” 

• Limitation: 50% of faults in software 
control systems are in the specification 

I'mril TrritrJM & Fault Twiable Hardware Design, P [ jU, Syracuse 
l ruversitj ,pp. H4-KA 


Diverse Design 

Software Voting 

In the N-version programming approach a number of 
independently written programs for a given function are 
run simultaneously; results are obtained by voting upon 
the outputs from the individual programs. In general the 
requirement that the individual programs should provide 
identical outputs is extremely stringent. Therefore, in 
practice "sufficiently similar" output from each program 
is regarded as equivalent; however, this increases the 
complexity of the voters [4.54]. 

h aulf Terrain A F:iuit Double ll.u dwart* Design. P l j!;t. WFS. p. T‘!5 


Diverse Design 

Case Study: Space Shuttle Computers 

Five Identical Sets of Computer Hardware 

- 4 run the primary software (PASS) 

• Each computer sees all I/O 

• Displays status to crew 

- 1 runs the Backup Flight System (BFS) 

• Runs during critical stages but does not control I/O 
unless engaged by the crew 

- Voting is done at the actuators (dynamic) 

- Crew provides decision making on switching 
redundancy (static) 


Diverse Design 

Case Study: Space Shuttle Computers 

DG How do you make the system reliable' 7 

As [ mentioned, there is a fifth computer that runs the Backup Flight 
System (BFS). Early on, NASA was concerned about the possibility of 
a generic software problem in the PASS what if there were a "bug" in 
the PASS that brought the entire primary system down 9 The way they 
alleviated their fears was by developing independent ascent and entry 
software from a subset of the requirements they had given us This 
independent software was written by Rockwell International and resides 
in the fifth computer 

The decision to engage the VGS is totally a crew function Their 
procedures identify certain situations for which the switch should be 
made for instance, loss of control, multiple consecutive failures of 
PASS computers, or the infamous two-on-two split where the computers 
split up into two pairs (we’ve neveT seen this occur) To date the crew 
has never had to use the BFS during a mission . 


Diverse Design 

Case Study: Space Shuttle Computers 


Diverse Design 

Case Study: Small Satellites/Surrey 


Some more information on this is available from _Computers in Spaceflight - 
The NASA Experience_, James E Tomayko, Wichita State University: 

At first the backup flight system computer was not considered to be a 
permanent fixture When safety level requirements were lowered, some IBM 
and NASA people expected the fifth computer to be removed after the 
Approach and Landing Test phase of the Shuttle program and certainly after 
the flight test phase (STS- 1 through 4). How ever, the utility of the backup 
system as insurance against a generic software error in the primary system 
outweighed considerations of the savings in weight, power, and complexity to 
be made by [104] eliminating it. 

(104] A D Aldrich, "A Sixth GPC On-Orbit," Memorandum, Johnson Space 
Center, Houston, TX, OctobeT 13, 1978, JSC History Office. 


Components: risk inherent in the use of components which 
are not formally "space qualified” 

New technologies: employed alongside flight-proven 
technologies in a “layered architecture” 

- Top-layer systems use state-of-the-art high-performance device 
types 

- Lower-layer systems use device-types which have been flown and 
tested in previous spacecraft, and which are able to carry out most 
of the same functions, albeit with a possible loss of performance 

Layered architecture protects against design faults. 








Diverse Design 

Case Study: Small Satellites/Surrey 

From the "Design Philosophy” section 

Recognising the risk inherent in the use of components which are not formally 
"space qualified", we use redundancy at many levels to reduce the risk of total 
mission failure When adopting new technologies, we employ them alongside 
flight-proven technologies in order to reduce risk Thus we build a "layered 
architecture”, in which each successive layer relies on different systems 
comprising increasingly well-proven technologies The top-layer systems use 
state-of-the-art high-performance device types - often without flight- heritage - 
but which give a high degree of functionality Whereas the lower-layer 
systems use device-types which have been flown and tested in previous 
spacecraft, and which are able to carry out most of the same functions, albeit 
with a possible loss of performance In this way, problems caused by an 
inherent system design fault, or by the failure of a particular device-type, are 
not duplicated in the different layers 







Configuration Control 

Sample Schematic - Further Detail 

m \ Sources of skew 

j 1 include routing 
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Clock Skew 


Clock Skew 


Normal Routing Resource 


Shift register is given as an example. Also seen 
in counters and other logic structures. 


Clock Skew 


* Clock trees are made to increase fanout. 

• Not placing buffers and flip-flops on the same row 

- Can increase skew problem. 


Clock Skew - Timing 

Model 

Tcq Trocte 

T h 

FFl _L 

FF2 

r\ r > . y 
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1/ v ± 
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• Hold time at FF2 is the concern. 


- Worst-case 


- Low V IH FFl 


- Hi V, H FF2 


— Fast T C q, T^^ 


- High T skew 
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Local Clock: Physical Realization 
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Design Strategy (2) 

Use of Local, High-Skew Clock 
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•This project had a design rule of no more than 5 loads 
on a local, high-skew clock. This was repeatedly violated. 




























Self-Test: Processors 


Processor Hardware Self-Test 

Typically, a self-test program for checkout or restarting is 
a boot-strapping procedure which begins with the 
verification of the most elementary set of instructions, 
i.e., those which rely on only a fraction of the computer 
hardware in order to operate. These instructions are then 
used to construct a decision-making subroutine which 
verifies some primitive condition on a YES-NO basis. 
Once verified, this subroutine (or several similarly 
constructed) is used to check all other instructions and 
variations in sequence, beginning with the next least 
complex instruction and working up to the most complex 
instruction. After all instructions are verified, 
input/output (I/O) and memory self-test programs check 
the remaining hardware. 


Processor Hardware Self-Test 

Case Study: Gemini 

Self-test routines are also important for detecting 
malfunctions during operation. In the Gemini 
project, for example, diagnostic subroutines were 
interleaved in the operational computer program. 
When they detected a fault, a discrete command 
was issued to light a malfunction indicator lamp 
on the control panel. The circuit had a manual 
reset capability to test whether it was set by a 
transient malfunction. 


Processor Hardware Self-Test 

Case Study: Gemini (cont’d) 

Three self checks were performed during flight: 

* A timing check, based on the noncoincidence of certain 
signals within the computer under proper timing conditions. 

* A thorough diagnostic test which exercised all of the 
computer’s arithmetic operations during each computer 
cycle in all modes. 

* A looping-check, to verify that the computer was following 
a normal program loop. A counter in the output processor 
was designed to overflow every 2.75 sec. Each program 
was written to reset this counter every 2.7 sec; thus, any 
change in the program flow would cause an overflow and 
indicate a malfunction. 


Processor Hardware Self-Test 

Case Study: Apollo Guidance Computer 

The Apollo guidance computer is equipped with a restart feature 
comprising alarms to detect malfunction and a standard initiation 
sequence which leads back into the programs in progress. The 
AGC has six malfunction detection devices that cause a restart, as 
follows: 

• A parity test of each word read from memory. An odd- 
parity bit is added to each fixed- memory word at 
manufacture time and to each erasable word at write time. 

• A looping check much like the one on Gemini. A specified 
register must be periodically tested by any correctly 
operating program. This register is "wired" and if it is not 
tested often enough will cause restart. 


Processor Hardware Self-Test 

Case Study: Apollo Guidance Computer 

• A transfer control trap, which detects endless loops 
containing only control transfer instructions, such as a 
location L which contains the instruction "transfer control 
to location L.” 

• An oscillator fail check caused by stopping of the timing 
oscillator. 

• Voltage fail circuits to monitor the 28-, 14-, and 4-V power 
levels which drive the computer. 

• An interrupt check, which detects excessive time spent in 
the interrupt mode, or too much time spent between 
interrupts. 








Processor Hardware Self-Test 

Case Study: Saturn V Launch Vehicle 

• Logic used TMR 

- Disagreement detector for faults 

- Switches to simplex if fault detected. 1 

- Memory was dual-redundant with parity 

- Both memories read in parallel 

- If fault, then backup memory read, correct data 
written to both memories (DRO core) 

- Switch prime and backup units 

\eed to verifv from a second source. 


Processor Hardware Self-Test 

Case Study: Saturn V Launch Vehicle 


Error Delect 
Logic 


From Processor 


Buffer 
Register B 



From Processor 


Saturn V LVDC Duplex Memory Diagram 
Self-Conecting Duplex Logic 


Processor Hardware Self-Test 

Case Study: Saturn V Launch Vehicle 


111 ! 


:* Voter 

:« Disagreement Detector 


Saturn V LVDC TMR Logic 


Processor Hardware Self-Test 

Case Study: Space Shuttle 

• 4 of the 5 identical computers operate in an 
NMR configuration 

- Computers synchronized and outputs between 
computers are compared on the I/O busses 

• Voting at the actuator 

- hydraulic voting mechanism: force-fight voter 

• After two failures, operates as a duplex 
system with comparison and self-test 
techniques 


Case Study: Lockstep Operation 


Processor Hardware Self-Test 

Case Study: MA31750/MIL-STD-1750A 


On-chip parity generation/checking 
Built-In test 

- Part of initialization 

- Manufacturer defined XIO Instruction 


- For Tracor RHEC and MAS281 

• BIT part of initialization 

• Called using Built-In Function (BIF) 4F 











Processor Hardware Self-Test 
Case Study: MA31750/MIL-STD-1750A 

Built-In Test (BIT) Coverage 

• Temporary Registers (T0-T1 1) 

• General Registers (RO-R 1 5) 

• Flags Block 

• Sequencer Operation and ROM Checksum 

• Divide Routine Quotient Shift Network 

• Multiplier and ALU 

• Barrel Shift Network 

• interrupts and Fault Handling and Detection 

• Address Generator Block 

• Instruction Pipeline 


Hardware Self-Check 
Case Study: IA-64 

• L2 and L3 are ECC protected 

- L2 is on-chip, 96 kB unified, 6-way set associated, 64-bytc line 

- L3 is on-cartridge, up to 4 MB, 4-way set associated, 64-fcyte line 

* ‘The processor implements a machine check 
architecture (MCA) that provides the ability to 
continue, Recover, or Contain detected errors. 
All significant structures on the chip are 
protected by parity of ECC.” 


" I hc First LA -64 Microprocessor," S Rusu and G. Singer. IEEE Journal 
of Solid-state Circuits. November, X'f'O 


Processor Hardware Self-Test 
Case Study: MAS281/MIL-STD-1750A 

Built-In Test (BIT) Coverage 

Microcode sequencer; IB Register Control; Barrel Shifter; Byte 
Operations and Flags 

Temporary Registers (T0-T7); Microcode Flags; Multiply; 
Divide 

Interrupt Unit - MK, PI, FT; Enable/Disable Interrupts 
Status Word Control; User Flags; General Registers (RO-R 15) 
Timer A; Timer B 


Hardware Self-Test 
Case Study: MIL-STD-1553B 


Mode Code 0001 1 - Initiate Self-test 
Terminal fail-safe. Hardware ensures that 
no transmission is greater than 800.0 ps 
(4.4. 1.3) 

Listening to the transmitted signal to ensure 
it matches what was sent. 

(Look up to see if 1553 requirement or 
implementation) 








Metastability - Introduction 

■ Can occur if the setup (t sl ), hold time (t H ), or clock pulse 
width (tp W ) of a flip-flop is not met. 

* A problem for asynchronous systems or events. 

Metastable States 


• Can be a problem in synchronous systems. 

• Three possible symptoms: 

- Increased CLK -> Q delay. 

- Output a non-logic level 

- Output switching and then returning to its original state. 

• Theoretically, the amount of time a device stays in the 
metastable state may be infinite. 

• Many designers are not aware of metastability. 


Metastability 

• In practical circuits, there is sufficient noise to move the 
device output of the metastable state and into one of the 
two legal ones. This time can not be bound. It is 
statistical. 

• Factors that affect a flip-flop’s metastable "performance" 
include the circuit design and the process the device is 
fabricated on. 

■ The resolution time is not linear with increased circuit time 
and the MTBF is an exponential function of the available 
slack time. 


Metastability 




(w - Time * in Jo* where inpul transition may cause a metastable condition 
tju 3 Actual clock setup time Tor flip-flop 
tco = Actual flip-flop propagation delay 
t m = Metastability resolution time 




Flip-Flop Timing: RT54SX-S 
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Metastable State: 
Possible Output from a Flip-flop 
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Metastable State: 

Possible Outputs from a Flip-flop 


Metastability - Calculation 


















Lockup States 

Yet Another One-Hot Implementation 


Lockup States 

A “Safe" One-Hot Implementation 


Modified one-hot state michine (reset logk omitted) for a 4-state, two- 
phase, non -over lapping dock generator. A NOR of all flip-flop 
outputs and the home state being encoded as the zero vector adds 
robustness. Standard one-hot state machines JQ3 would be tied to the 
input of the first flipl have l flip-flop per state, with exactly one flip- 
flop set per state, presenting a non-recoverable SEU hazard. 
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Reset flip-flops. Note second one is on falling edge 
of the cfock. This implementation uses 6 flip-flops. 


Lockup States 

- Binary Encoding 
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Lockup States 

Binary Encoding 


Typo StatoTypo la [ Home, On* , Two, Three , Tour); 
Signal State i Statetyp*; 


Case State Is 


When Others -> State <- Home; 

"‘When Others” refers to states in the enumeration, not 
the physical implementation. Also, states that are not 
reachable can be deleted, depending on the software and 
settings. 


Two Most Common Finite State 
Machine (FSM) Types 

• Binary: Smallest m (flip-flop count) with 2 m £ n 
(state count), highest encoding efficiency. 

- Or Gray Coded, a re-mapping of a binary FSM 

• One Hot: m = n, i.e., one flip-flop per state, lowest 
encoding efficiency. 

- Or Modified One Hot: m = n- 1 (one state represented by 
0 vector). 

Issue: How To Protect FSMs Against Transient Errors 
(SEUs and MEUs): 

• Illegal State Detection 

• Adding Error Detection and Correction (EDAC) 
Circuitry 


Many of the following slides are from: 

Sequential Circuit Design lor Spncehorne and Critical 
Llec Ironies 

Mil/ Aero Applications of Programmable Logic Devices 
(M APLD) International Conference, 2000. 
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Modified One Hot FSM 
Illegal State Detection 

Error detection more difficult than for one hot 

- 1 -* 0 upsets result in a legal state. 

- Parity will not detect all SEUs. 

- If an SEU occurs, most likely the upset will be 
detectable 

Recovery from lockup sequence simple 

- If all 0’s (NOR of state bits), then generate a 1 to first 
stage. 

- If multiple l's (more difficult to detect), then will wait 
until all l's are "shifted out." 


Is There a Best FSM Type, and Is It Best 
Protected Against Transient Errors By 
Circuit-Level or System-Level EDAC? 

• Circuit-level EDAC 

- Expensive in power and mass if used to protect 
all circuits 

- Can be defeated by multiple-bit transient errors 

• System-level EDAC 

- Required for hard-failure handling 

- Relies on inherent redundancy in system, high- 
level error checking, and some EDAC hardware 


System-Level Error Checking 
Mechanisms 

Natural error checking mechanisms 

- e.g., fire a thruster, check for spacecraft attitude change 
Checking mechanisms arising from multiple 
subsystems 

- e.g., command a module to power on, check its current 
draw r and temperature 

Explicitly added checking mechanisms 

- Watchdog timers 

- Handshake protocols for command acknowledgement 

- Monitors, e.g., thruster on-time monitor 


Transient Errors Cause FSM 
Jumps to Erroneous States 


Jump to 

Illegal 

state 


System-Level Error Handling 
Mechanisms Also Handle 
Transient Error Effects 


Transient Error Effect 

System Response 

Command Rejection 

Command Retry 

Telemetry or Data Corruption 

Data Filtering, also required to 
handle system noise 

FSM Lock-up, e.g., detected by 
multiple command rejections 

Indistinguishable from hard 
error 


EDAC Required For Some FSMs 
Based on Criticalness of Circuit and 
Probability of Error 

Common EDAC Types 


Hamming 


Capability 
Detect 1 bit error, 
correct 0 


Correct int(M/2) bit errors 
(strong correction) 

Correct 1 bit error. Detect 2 
(or more, depending on code) 
(weak correction) 


Power & Mass Impact 
Extra bit, parity trees to 
set and check 


Multiplies gate count by N+ 
and clock loading by N 

Close to TMR in gate count, 
much lower clock loading 


















(cm 2 /flip- flop) 


VHDL “Interface” 


VHDL and Software Issues 


Library TEES; 

Use IEEE 5td_Logic_1164 All; 
Entity 3ooI Is 

Port ! X : In Std_Logic; 

'{ : In Std_Logic; 

Z ; Cut Boolean ); 

End Bool ; 

Library IEEE; 

Use IEEE, 5td_Logic_ll$4 All; 
Architecture Sool_Teat of Bool la’ 
Begin 

P: Process [ X, Y 1 
Begin 

:e ( x - y ) 

Then Z «■ True; 

Elae Z c- False; 

End If; 

End Process P; 

End 9ool_Tast; 


Boolean signal was mapped to different logical values in 
different versions of the same VHDL logic synthesizer 


An HDL Flow 


SEU Requirement: 

LET th >37 MeV-cm-Vmg 


Act 2 SEU Flip-Flop Data 


LET (MeV-cm 2 /mg) 












Logic Translation/Optimization 

Implementation 

T " Original 
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Logic Replication 




“Optimized” 


The tw* dreuili «r« lwglc»Ily fqulvaletit whea analysed with Boolean legk equation) with 
the fewer, CAZ -optimized circuit, permitting higher device apeedf. An SCI! anatyrii show) 
the add t ion ef ■ reread date variable with an upset resulting in the "optimized” drruit 
containing a ttaie where Q - QN, violating the system equations and causing a failure. 



Delay Generation 


VHDL Code and Synthesizer Analysis 

Case Study - Hardened Clock Generator 

The VHDL synthesizer, unknown to the 
designer, generated a poor circuit for a 
TMR voter 

- Used 3 C-Cells for a voter 

- Slowed the circuit down 

The implementation of the voter is hidden 
from the user 

- Synthesizer generated a static hazard 

- An SEU can result in a glitch on the "hardened" 
clock signal. 


VHDL Code and Synthesizer Analysis 

Case Study - Hardened Clock Generator 

- Oivide 25 MHz (40 na) clock by 4 
-- to produce 6.25 MHz clock (160 ns) 

-- This clock should be placed on 
-* an internal global buffer 

clkintl: clkinc 

Port Map ( A clk_div_cnt (1) , 

Y ■> Clk_div4 ) ; 

clkdiv: Process (reset_n, elk) 

Begin 

If reset_n » '0' Then 
Clk_div_cnt <* "00"; 

Els if elk * ’1’ And elk ’EVENT Then 

Clk_div_cnt <■ clk_div_cnt + 1; 

End If; 

End Process clkdiv ; 


VHDL Code and Synthesizer Analysis 

Case Study - Hardened Clock Generator 


\ t 


L : , n *•£: 


Most significant bit of the counter. 3 C-Cells are used for the voter. 










Loss of Functionality 


FRAM Memory Functionality 
Loss During Heavy Ion Test 


FRAM 

DRAM - JEDEC 

JTAG 

PROM 

Microprocessor 



DRAM Modes 

DRAM Special Test and Operational Modes 

This standard defines a scheme for controlling a scries of special modes for 
address multiplexed DRAM The standard defines the logic interface 
required to enter, control, and exit from the special modes In addition, it 
defines a basic special test mode plus a series of other special test and 
operational modes. 

TEST MODES are those that implement some special test of measurement 
function or algorithm designed to enhance the ability of the Vendor or User 
to determine the integrity of, or to characterize, the pan 

OPERATIONAL MODES are those that alter the operational 
characteristics of the part but do not interfere with its function as a storage 
device and are intended to be used in system operation 


JEDEC Standard No 2\-C, page ^ 9 5-7, Release 4 


DRAM Refresh 



Refresh Control 


Refresh Counter 


► Row Addr Buffer! 


i Col Addr Buffer h 


Adapted from: http-'Avww.tecchannel ik 'hardware/ 1 73/6 htmj 


Memory Army 


Column Decoder 


DRAM Refresh 


CAS#-BEFORE-RAS# REFRESH is a frequently used method of 
refresh because it is easy to use and offers the advantage of a power 
savings Here’s how CBR REFRESH works The die contain* an 
internal counter which is initialized to a random count when the device 
is powered up Each time a CBR REFRESH is performed, the device 
refreshes a row based on the counter, and then the counter is incremented 
When CBR REFRESH is performed again, the next row is refreshed and 
the counter is incremented The counter will automatically wrap and 
continue when it reaches the end of its count. There is no way to reset the 
counter The user does not have to supply or keep track of row addresses 

Since CBR REFRESH uses the internal counter and not an external 
address, the address buffers are powered down For power-sensitive 
applications, this can be a benefit because there is no additional current 
used in switching address lines on a bus, nor will the DRAMs 
pull extra power if the address voltage is at an intermediate state. 

.ipst'l fi.-rit MfcnViT ci hiut.il NoU* ’Yjnnuv Vcih-h ;>f fceth^h.” 
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System Logic 



IEEE JTAG 1 149. 1 - Scan Path 


The CLK pin may turn into an output driving low, clamping 
the oscillator's output at a logic ‘O’. The TAP controller can 
not reset and restore I/O operation. Most FPGAs do not have 
the optional TRST* pin. Note TRST*. when present, has a 
pull-up. 



IEEE JTAG 1 149. 1 - Scan I/O Cel! 



JTAG Upset Effect - Step Load 

TCK and TMS=1 Not Guaranteed Solution 


p Large Step Load 


Brand X SEE Test 
BNl 02/98 
NASA/GSFC 
BB Pattern/ 2 um Epi 
XI B3 
Bromine 


JTAG DATA PATH 


JTAG Upset Effect - Step Load 

Second Distinct Failure Mode 


JTAG Upset Effect - TCK On 


4 ■ Brand X SEE Test 
BNL 02/98 
NASA/GSFC 
2 ■ BB Pattem/2 pm Epi 
XI B4 
Bromine 


a 10 12 14 is is 

Time (Sec) 


Sample of 3 JTAG ’Upsets’ 
TCK = 6 kHz 



V few >M |TUT1UTInya«l 


Semple Number (in IOOCi) 
(-250 n Sec/S ample) 












SEE Results - Loss of Functionality 

Atmel AT28C010 EE PROM, D/C 9706 


Atmel AT28C010 EEPROM, D/C 9706 

Type I Errors 


OIO 20 30 40506070 

LET (MeV/{mg/cm 2 )) 


MtntfeiCed by the *ppMr*nc« (if repeated ernmk tw» the fim error had been delected durmg ton 
irradiation Here, the (tnl error appeared at rm pml in time. whith wee tene of rradm j cycke 
("cyck" ledeHncd in Sot two 10 »fkr the etpoetn hod Oerleil Thereafter we observed one error 
every few cycke. 

Error* were altered hit* in one word *1 vtravu* oddree* locahon* 

Simultaneously with the observation of the r r* error, (he device bws ciereM increased to Z6 mA 
from 10 mA (normal, pre-error condition) The bw current continued to he ^6 mA until the rending 
prnces* Hopped At thel tm*. the carent heesrw 0 Z mA {q a i ea c e nr level). 

When the device »u read again (without power-cycling), the hiss cwirri returned to 2€ mA end 
error* ^peered ipn (even without the beam) 

If the power to the device was this off end re- iterted again (power-cyckd), the demee again 
hi nc honed property (i * , no error*) 

In on* n nance w* continued the irrwdiaron withoa power -cvc ling for a long rime, anil the device 
no longer showed any error*, ft app ea red that the affected bit imderwent additional upeet, returning to 
the ongnwl potency and thereby correct wig Ac problem 


Atmel AT28C010 EEPROM, D/C 9706 

Type II Errors 

• Manifested by ”00” in all address locations, 
once the first "00" was read. 

• These errors could be removed only by 
power-cycling the device. 


Atmel AT28C010 EEPROM, D/C 9706 

Type III Errors 

• Characterized by occasional errors in a byte, 
which appeared once in many cycles- There was 
no 'after-effect' for this type of error. In other 
words, one error appeared independently once in a 
while. 

• Caused by an upset in the output buffer. 


X28HC256 CMOS EEPROM 

Xicore, D/C 9140 

• Upset mode which also required the cycling of 
power to dear. 

- i<r 3 i 


Loss of Functionality 
Serial PROM 

Xilinx XQR1701L 

- 10% saturated intercept at LET=6, 1 .2x 1 0 _s 
cnvVdevice 


l 10“ 7 ~ SBJ / 

<3 KT 8 ! I 1 1 1 1 i 1 I I- 

0 10 20 30 40 50 60 70 80 90 100 

LET (VtoV/fmg/cm 2 )] 


Reference: DS062 (v3.0) February s, 2001 , 








Loss of Functionality 

Processors 

Processor simply stopped functioning without showing any observable 
bit errors 

Noticed lockup in many microprocessors including MG80C136, 
MG80C286, and XC68302 

Sensitivity to lockup was essentially independent of the test programs 


urigfc \ 'vow i uucrKtrul tnUEtrup 1SOT1 • i eri.vtivS« tn ivtl’Kf 5 M>.' K ; MAI'L! 
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Loss of Functionality 

Processors: XC68302 Example 


■s it r 2 



0 5 10 15 20 25 30 35 <0 

LET [M*V/(mgfcm 2 )] 




Specifications 


Specifications 
Case Study 1 

• Gate Array Operation Differed from 
Specification 

- No Continuity of Personnel on Project 

- Features Added and Deleted During 
Development 

- Changes Were Not Documented in 
Specification 


Specifications 
General Principles 


• No Specification Produced 

• Specification not Followed 

Common Error - Seen More Often Than One 
Would Expect 


Specifications 
Case Study 2 

• Continual Updates to FPGAs Caused 
Delays to Project 

- Drifting Software Requirements Impacted 
FPGA 

- Drifting System Requirements Impacted FPGA 


• No Stable Specification 








Reliance on Logic Simulators 
General Principles 

Simulators and Limitations 


• Run Time Limited 

• Number of Vectors 

• Vector Generation 

• Number of Operating Modes 

• Time for Modeling External Circuitry 

• CAE S/W Limitations 


Reliance on Logic Simulators 
Case Study 1 

• Simulator Could Only Simulate 1 ms. 

- Instrument Had a 125 ms Cycle Time. 

• Simulating All Inputs Not Practical 

- Too Many Combinations 

-^Failed to Find a Logic Error Which Caused 
an Arithmetic Error 


Reliance on Logic Simulators 
Case Study 2 

• FPGA Converted to ASIC 

• No Gate Level Design Review Performed at 
Any Stage 

• Test Vectors from FPGA Version Were Not 
Run on the ASIC Version 

• Test Vectors Were Capable of Detecting the 
Design Error 


Analysis vs. Simulation 

From the Project documentation: 

All ... Actel designs were re-simulated using back-annotated 
timing data, to ensure that clock skews were within proper limits. 

From Actel documentation: 

To verify that a design works properly, both the design's 
functionality and its timing must be checked. Static timing 
analysis checks timing, but not the design's functionality. 
Simulation checks the functionality of a design, but It may 
miss some timing problems. Used together, static timing 
analysis and simulation complement each other to provide 
complete design verification 


Analysis vs. Simulation (cont’d) 

From Actel documentation 

Both gate array and FPGA designs are susceptible to race conditions, 
which require careful analysis of setup and hold times, and clock skew 
across best-case and worst-case operating conditions This application 
note describes how to use the Actel Timer to analyze accurately these 
types of potential timing problems The Timer is a powerful static timing 
analysis tool that can be used successfully to check setup and hold times 
and clock skew. 

Since gate array devices are not production tested for setup and hold 
times, these parameters must be sufficiently guardbanded to guarantee 
they will never cause a failure This is difficult when using 
backannotated timing simulation since simulation software does not allow 
best-case and worst-case timing analyses at the same time Often such 
analysis is done be hand, if at all In some cases, designers simply switch 
their data with the inactive edge of the clock to avoid such timing 
problems 
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Verification 


Verification Issues (1) 

• Macro generators fail 

- Expect them to be coned by construction 

- Working macro fails in later revisions 

• ex., modulo counter 



• VHDL Synthesis 

- Simulated vs. Synthesized Results 

• Latch vs. FI ip- Hops. 

- Lockup states in FSMs 

- Introduction of static hazards 

• No simulations or timing analysis. 


Verification Issues (2) 


Verification Issues (3) 

• Detailed peer-review of the design is not 


• Inadequate Reviews 

performed 


- Slide flipping 

- Designs “approved” at the CDR 


- Unskilled reviewers 

- FPGA designs not completed at the CDR 


- Insufficient time 

- Management barriers to review 


- Findings not enforced 

- Simulation does not replace analysis 



- Testing does not replace analysis 


* Unresolved problems 

• Complete worst-case analysis not performed 


- Glitches not fully understood 

• Asynchronous design risks not identified, assessed 



and mitigated 




Review Samples 

■ Red Team Review 

- No Issues 

- Good FPGA design practices applied 

• NASA Civil Servant Design Engineer 

- "Oh my God ! " 

• NASA On-Site Contractor Design Engineers 

- "This circuit <expletive 
deleted> ! w 

- "Oh, <expletive deleted>. <pause> 
Oh, <expletive deleted>!" 


Design Rule Compliance 
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- Violation of project clock loading rule of no more than 
5 flip-flops on a local clock. PB_OSC has 23 loads. 











Conclusion (1) 

One must understand not only 
the “how” but the “why.” 

Otherwise, failure is not a 
matter of ‘if but of ‘when.’ 


Conclusion (2) 

The key to developing engineering 
confidence is the rigorous identification 
of the cause for ALL failures encountered 
for ALL phases of testing ... 

Dr. Joseph F. Shea, Deputy 
Director of Manned Space Flight, 

Spacebome Computer Engineering Conference 
October, 1962. 




